{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# kNN vs. SVM\n",
    "\n",
    "A very common workflow is to index some data based on its embeddings and then given a new query embedding retrieve the most similar examples with k-Nearest Neighbor search. For example, you can imagine embedding a large collection of papers by their abstracts and then given a new paper of interest retrieve the most similar papers to it.\n",
    "\n",
    "TLDR in my experience it ~always works better to use an SVM instead of kNN, if you can afford the slight computational hit. Example below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.random.seed(42)\n",
    "\n",
    "embeddings = np.random.randn(1000, 1536) # 1000 documents, 1536-dimensional embeddings\n",
    "embeddings = embeddings / np.sqrt((embeddings**2).sum(1, keepdims=True)) # L2 normalize the rows, as is common\n",
    "\n",
    "query = np.random.randn(1536) # the query vector\n",
    "query = query / np.sqrt((query**2).sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "top 10 results:\n",
      "row 545, similarity 0.07956628031855817\n",
      "row 790, similarity 0.0710937236589117\n",
      "row 973, similarity 0.0692079948121463\n",
      "row 597, similarity 0.0647482457550396\n",
      "row 479, similarity 0.06350781255023308\n",
      "row 229, similarity 0.061432183499702385\n",
      "row 976, similarity 0.06122285352624162\n",
      "row 568, similarity 0.06088872280511322\n",
      "row 800, similarity 0.06007081261453451\n",
      "row 654, similarity 0.05815882432824042\n"
     ]
    }
   ],
   "source": [
    "# Tired: use kNN\n",
    "similarities = embeddings.dot(query)\n",
    "sorted_ix = np.argsort(-similarities)\n",
    "print(\"top 10 results:\")\n",
    "for k in sorted_ix[:10]:\n",
    "  print(f\"row {k}, similarity {similarities[k]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "top 10 results:\n",
      "row 0, similarity 0.9797112617216354\n",
      "row 546, similarity -0.8360649738915676\n",
      "row 791, similarity -0.8519226181122039\n",
      "row 974, similarity -0.8585435504683989\n",
      "row 480, similarity -0.8620392370633865\n",
      "row 598, similarity -0.8653315003700208\n",
      "row 230, similarity -0.8671983886478063\n",
      "row 569, similarity -0.8674761579346136\n",
      "row 977, similarity -0.8705646065664835\n",
      "row 801, similarity -0.8728033782558366\n"
     ]
    }
   ],
   "source": [
    "# Wired: use an SVM\n",
    "from sklearn import svm\n",
    "\n",
    "# create the \"Dataset\"\n",
    "x = np.concatenate([query[None,...], embeddings]) # x is (1001, 1536) array, with query now as the first row\n",
    "y = np.zeros(1001)\n",
    "y[0] = 1 # we have a single positive example, mark it as such\n",
    "\n",
    "# train our (Exemplar) SVM\n",
    "# docs: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html\n",
    "clf = svm.LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-6, C=0.1)\n",
    "clf.fit(x, y) # train\n",
    "\n",
    "# infer on whatever data you wish, e.g. the original data\n",
    "similarities = clf.decision_function(x)\n",
    "sorted_ix = np.argsort(-similarities)\n",
    "print(\"top 10 results:\")\n",
    "for k in sorted_ix[:10]:\n",
    "  print(f\"row {k}, similarity {similarities[k]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice you will find that this ordering:\n",
    "\n",
    "- is of higher quality\n",
    "- is slower: we have to train an SVM\n",
    "- can easily accommodate a number of positives not just one, so it is more flexible\n",
    "- don't be scared of having a single positive and everything else being negative. this is totally fine!\n",
    "- if you have way way too many negatives, consider subsampling and only using a portion of them.\n",
    "\n",
    "**Value of C**: You'll want to tune C. You'll most likely find the best setting to be between 0.01 and 10. Values like 10 very severely penalize the classifier for any mispredictions on your data. It will make sure to fit your data. Values like 0.01 will incur less penalty and will be more regularized. Usually this is what you want. I find that in practice a value like 0.1 works well if you only have a few examples that you don't trust too much. If you have more examples and they are very noise-free, try more like 1.0\n",
    "\n",
    "**Why does this work?** In simple terms, because SVM considers the entire cloud of data as it optimizes for the hyperplane that \"pulls apart\" your positives from negatives. In comparison, the kNN approach doesn't consider the global manifold structure of your entire dataset and \"values\" every dimension equally. The SVM basically finds the way that your positive example is unique in the dataset, and then only considers its unique qualities when ranking all the other examples.\n",
    "\n",
    "Ok cool try it out."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
